# Framing

So here's the deal, you could take this notebook, read through it, run it, and turn in its output, and that would meet expecations for this week. Of course, my hope is that you'll do more than this, but it's important for you to know that you don't have to unless you want to exceed expecations. So I'm attaching a stretch goal to incentie a little more than meets expectations, create a model that gets an F1 score in excess of 0.7 on the challenge data. The point is, if this dosn't seem doable for you, I don't want you spending too much time on this. 


# Load Libraries

You don't need to worry about reading through the code in this first cell. It's just us loading libraries and functions. In the training notebook, these were spread out, but here we've moved them all up to the front. 

In [None]:
import pandas as pd

# I'll need these libraries to make it work
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.metrics import roc_curve, auc, precision_recall_curve
import matplotlib.pyplot as plt

from IPython.core.display import HTML
    
def evaluate(prob, pred, labels_test, verbose=1):
    acc = accuracy_score(labels_test,pred)
    ones = sum(labels_test)/len(labels_test)
    zeros = (1 - ones)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall_ = tp / (tp + fn) 
    precision_ = tp / (tp + fp)
    f1 = (2 / ((1/recall_)+(1/precision_)))

    false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,prob)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    
    if verbose==1:
        
        plt.rcParams['figure.figsize'] = [14, 4]
        
        plt.subplot(1, 3, 1)
        plt.title('Receiver Operating Characteristic')
        plt.plot(false_positive_rate, true_positive_rate, 'b',
        label='AUC = %0.2f'% roc_auc)
        plt.legend(loc='lower right')
        plt.plot([0,1],[0,1],'r--')
        plt.xlim([-0.1,1.2])
        plt.ylim([-0.1,1.2])
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')

        precision, recall, thresholds = precision_recall_curve(labels_test,prob)

        plt.subplot(1, 3, 2)
        plt.title('Precision-Recall Curve')
        plt.plot(recall, precision,'b',
        label='AP = %0.2f'% precision.mean())
        plt.legend(loc='lower right')
        plt.xlim([-0.1,1.2])
        plt.ylim([-0.1,1.2])
        plt.ylabel('Recall')
        plt.xlabel('Precision')

        plt.subplot(1, 3, 3)
        plt.hist([labels_test,prob], label=["true","pred"], bins='auto')  # arguments are passed to np.histogram
        plt.title("Predictions vs Reality")
        plt.legend(loc='best')
        plt.xlim([-0.1,1.2])

        plt.tight_layout()
        plt.show()    

        print ("\n0s: %0.2f\tTrue Positives: %s\tAccuracy: %s "%(zeros,tp,acc))
        print ("1s: %0.2f\tTrue Negatives: %s\tAUC: %s"%(ones,tn,roc_auc))
        print ("\t\tFalse Positives: %s\tF1 Score: %s"%(fp,f1))
        print ("\t\tFalse Negatives: %s\tRecall (fract of actual yeas found): %s"%(fn,recall_))
        print ("\t\t\t\t\tPrecision (correctness of yeas predicted): %s\n"%precision_)        
        
        null = zeros
        if null < ones:
            null = ones
        
        if (acc>null) and (roc_auc>0.5) and (recall_>0.5) and (precision_>0.5):
            print("################")
            print("#     PASS     #")
            print("################")
            vpass = "Yes"
        else:
            print("################")
            print("#     FAIL     #")
            print("################")
            vpass = "No"

# Load your labeled data (people.csv and calls.csv)

Broadly speaking, what you need to do this week is load the labeled data, clean it, train a model, load the challenge data, clean it, then run it through the modle you trained before. The labeled and challenge data have the same format, except the challenge data doesn't have a _take_ column. That means, you should be performing the same cleaning steps for both sets of data. Earlier in discussion with folks, I suggested that folks just copy their cleaning steps, but to make it easier on this notebook, I'm just going to ask you to come back up to this part of the notebook and run a differnt set of data through the steps. 

Anywho, the way we'll make this work is by running through this notebook multiple times. The first time through you'll load the labeled data, clean it and tran your models. The second time through you'll load the challenge data, pass it through the model you trained and collect it's output. 

This next cell promptes you to say if you're doing the first run through (training) or the second (prediction). Based on your answer it will either load the labled data or the challenge data. If you're doing the former, it will also ask you if you want to limit the size of your training data to make things run faster. 

After loading your data, this cell will deal with datatypes and make sure that the dataframes use a common column name for the person_id.

In [None]:
pass_through = input("Training or Predicting? (t or p): ") 

if pass_through == "t":
    df_1 = pd.read_csv('people.csv', parse_dates=[4]) 
    df_2 = pd.read_csv('calls.csv', parse_dates=[11,12])
    print("\nIf you'd like to limit the number of rows loaded, provide a row limit. Otherwise leave the promt blank.")
    limit_rows = input("Row limit?")        
    if limit_rows != '':
        limit_rows = int(limit_rows)
else:
    df_1 = pd.read_csv('challenge_people.csv', parse_dates=[4])    
    df_2 = pd.read_csv('challenge_calls.csv', parse_dates=[11,12])

print()
    
df_1["date of birth"] = pd.to_datetime(df_1["date of birth"], errors='coerce')
print(df_1.dtypes)
display(df_1.head())

df_2 = df_2.rename(columns={'person_ID': 'person_id'})
print(df_2.dtypes)
df_2.head()

# Merge Your Data, Clean Your Data, and Build Features

Here we'll combine our dataframes together into a single dataframe we will call `single_table`.

Note: If you want to exceed expectations, this is where you would do it by adding or subtracting features. I've created a few features, but the performance of this model isn't very good. To pass an F1 of 0.7, you'll need to think more about what features really need to be here.

In [None]:
single_table = df_2.merge(df_1, how="left", on="person_id") 
single_table.head()

Just to get a feel for the column names, let's print them out. 

In [None]:
single_table.columns

You'll notice that the _Unambed: 0_ column was duplicated. So will need to get rid of the `_x` and `_y` columns later on.

Let's make sure we confront the issues mixed case by makeing all the columns with strings uppercase. 

In [None]:
single_table['Referal Soure'] = single_table['Referal Soure'].str.upper()
single_table['body_part_1'] = single_table['body_part_1'].str.upper()
single_table['body_part_2'] = single_table['body_part_2'].str.upper()
single_table['body_part_3'] = single_table['body_part_3'].str.upper()
single_table['body_part_4'] = single_table['body_part_4'].str.upper()
single_table['body_part_5'] = single_table['body_part_5'].str.upper()
single_table['surgery'] = single_table['surgery'].str.upper()
single_table['sex'] = single_table['sex'].str.upper()

# If this is the training pass, we'll want to make sure the take column is uppercase as well. 
if 'take' in single_table.columns:
    single_table['take'] = single_table['take'].str.upper()

single_table.head()

We want to turn our surgey column into mnumbers. First we need to make sure we understand what's in the column. 

In [None]:
single_table['surgery'].unique()

Great, just NO and YES values. So let's turn the NOs to 0 and the YESes to 1.

In [None]:
single_table.loc[single_table['surgery'] == "YES", 'surgery'] = 1
single_table.loc[single_table['surgery'] == "NO", 'surgery'] = 0
single_table.head()

We should do the same thing with the take column, assuming we're on the training pass where single_table has a take column.

In [None]:
if 'take' in single_table.columns:
    
    print(single_table["take"].unique())
    
    single_table.loc[single_table['take'] == 'YES', 'take'] = 1
    single_table.loc[single_table['take'] == 'Y', 'take'] = 1
    single_table.loc[single_table['take'] != 1, 'take'] = 0
    
    display(single_table.head())

I showed you this code in class, but here it is for you to use. What's going on here is that we're creating dummy variables for each body part column and then grouping them all together. **This could take a few minutes.**

In [None]:
# note this cell may take a while to run

parts = pd.get_dummies(single_table['body_part_1'])
parts = pd.concat([parts, pd.get_dummies(single_table['body_part_2'])], axis=1)
parts = pd.concat([parts, pd.get_dummies(single_table['body_part_3'])], axis=1)
parts = pd.concat([parts, pd.get_dummies(single_table['body_part_4'])], axis=1)
parts = pd.concat([parts, pd.get_dummies(single_table['body_part_5'])], axis=1)
parts = parts.groupby(by=parts.columns, axis=1).any().astype(int)
parts.head()

# note this cell may take a while to run

We can then take this set of new columns and add them to the existing table by using concat.

In [None]:
single_table = pd.concat([single_table, parts], axis=1)
single_table.head()

Now we've gotten to the spot where we drop the columns we don't want to consider. You'll note that I didn't remove the call_id. That was on purpose. After this cell, we'll make sure everything is numbers and get rid of those rows that aren't numbers. At that time, we'll pull out the call_ids, but only then because if you're in your second run through (prediction) we'll need to match those IDs up with your predictions and we need to make sure that we aren't changing the row order after we pull out that list. That is, we make sure that the row order is fixe before we caputure a list of the IDs.    

In [None]:
df_num = single_table.drop([
                                            'Unnamed: 0_x',
                                            'Unnamed: 0_y',
                                            'name',
                                            'Referal Soure'
                                            ,'body_part_1'
                                            ,'body_part_2'
                                            ,'body_part_3'
                                            ,'body_part_4'
                                            ,'body_part_5'
                                            ,'sex'
                                            ,'attorney'
                                            ,'injury_date'
                                            ,'intake'
                                            ,'date of birth'
                                            ,'person_id'
                                           ], 1).copy()
print("row count:",len(df_num),"\n")

print(df_num.dtypes)
df_num.head()

Ane let's just make sure that everything in this table is a number. 

In [None]:
df_num = df_num.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these (NaNs) like so.
print("row count before drop:",len(df_num))
df_num = df_num.dropna()
print("row count after drop:",len(df_num))
display(df_num.head())
df_num.dtypes

Here's where we pull out the call_ids. First we put them into a list, then we remove the column from df_num.

In [None]:
call_ids = df_num["call_id"]

df_num = df_num.drop(['call_id'], 1)

In [None]:
if pass_through.lower().strip() == "t":
    d = 1
else:
    d = 0
    print(      "\n==============================\n")
    display(HTML('  You have already trained\n a model(s). Jump to: <a href="#Making-Predictions">Making Predictions</a> and run its cells.'))
    print(       "\n==============================\n")
    
1/d

# Training

Create class_df from df_num. If you set a row limit above, limit class_df to the number of rows you set, otherwise do nothing.  

In [None]:
if limit_rows != '':
    class_df = df_num.sample(n=limit_rows)
else:
    class_df = df_num

Break your dataframe into training and testing data. 

In [None]:
# create a dataframe containing a random sample of rows
class_holdout = class_df.sample(frac=0.90)

# create a dataframe that conatins the rows from except for those in holdout
class_training = class_df.loc[~class_df.index.isin(class_holdout.index)]

# make a training dataframe containing just features
features_train = class_training.drop("take", axis=1).values

# make a training dataframe containing only the target
labels_train = class_training["take"].values

# make a testing/holdout dataframe containing just features
features_test = class_holdout.drop("take", axis=1).values

# make a testing/holdout dataframe containing only the target
labels_test = class_holdout["take"].values

print("size of training",len(class_training))

Train your models.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept = False, C = 1e9)
clf_1 = model.fit(features_train, labels_train)
pred = clf_1.predict(features_test)
prob = clf_1.predict_proba(features_test)[:,1] 
print("Logistic Regression")
evaluate(prob, pred, labels_test)  

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_3 = RandomForestClassifier()
clf_3 = clf_3.fit(features_train, labels_train)
pred = clf_3.predict(features_test)
prob = clf_3.predict_proba(features_test)[:,1] 
print("Random Forest")
evaluate(prob,pred, labels_test)  

In [None]:
if pass_through.lower().strip() == "t":
    d = 0
    print("\n==============================\n")
    display(HTML('   Before you make predictions\n Go back and rerun the cells above\n from <a href="#Load-your-labeled-data-(people.csv-and-calls.csv)">Load-your-labeled-data</a> through Training'))
    print("\n==============================\n")
else:
    d = 1
    
1/d

## Making Predictions 

Set clf to your best model above, and set input equal to df_num. Keep in mind that since you only run this cell during your second passthrough (prediction), df_num is now the challenge data. 

In [None]:
# What classifier do you want to use? Above, each one is 
# given it's own number (e.g., clf_1, clf_2, etc.)
clf = clf_3

# Set the values for each of the above features: accum, min, max, wind
inputs = df_num.values

Create a new dataframe with your predictions and the call IDs.

In [None]:
prediction_list = pd.DataFrame(
    {
        'call_id': call_ids, 
        'take': clf.predict(inputs)
    })
prediction_list.head()

Filter this new dataframe such that it only includes those with a predicted take and show us the call_ids for those rows. 

In [None]:
prediction_list[prediction_list["take"]==1]["call_id"].values